A publicly accessible database of UK university website links and a discussion of the need for human intervention in web crawling
نویسنده
چکیده
This paper describes and gives access to a database of the link structures of 109 UK university and higher education college websites, as created by a specialist information science web crawler in June and July of 2001. With the increasing interest in web links by information and computer scientists this is an attempt to make available raw data for research that is not reliant upon the opaque techniques of commercial search engines. Basic tools for querying are also provided. The key issues concerning running an accurate web crawler are also discussed. Access is also given to the normally hidden crawler stop list with the aim of making the crawl process more transparent. The necessity of having such a list is discussed, with the conclusion that fully automatic crawling is not socially or empirically desirable because of the existence of database-generated areas of the web and the proliferation of the phenomenon of mirroring.
منابع مشابه
Georeferencing Semi-Structured Place-Based Web Resources Using Machine Learning
In recent years, the shared content on the web has had significant growth. A great part of these information are publicly available in the form of semi-strunctured data. Moreover, a significant amount of these information are related to place. Such types of information refer to a location on the earth, however, they do not contain any explicit coordinates. In this research, we tried to georefer...
متن کاملطراحی و ساخت پایگاه وب منابع اطلاعات شاخص های پایش و ارزیابی علم، فناوری و نوآوری
So far, many indicators for evaluation of science, technology and innovation have been presented in various documents in Iran. Also, many indicators have been mentioned in the reports of international organizations. Selection and use of the indicators is difficult for policy makers and researchers because of the abundance and distribution of them in various domestic and international documents ...
متن کاملتشخیص ناهنجاری روی وب از طریق ایجاد پروفایل کاربرد دسترسی
Due to increasing in cyber-attacks, the need for web servers attack detection technique has drawn attentions today. Unfortunately, many available security solutions are inefficient in identifying web-based attacks. The main aim of this study is to detect abnormal web navigations based on web usage profiles. In this paper, comparing scrolling behavior of a normal user with an attacker, and simu...
متن کاملارزیابی وب گاه دانشگاه علوم پزشکی تهران براساس معیارهای وب سنجی در سال 2008
Background And Aim: Nowadays university websites are very important in information services. There fore university has designed website for categorizing and availability of mass of information . This study accomplish to purpose evaluated of Tehran university of medicine sciences website base on webometrics criteria on 2008 . Materials and Methods: This survey have been used link analysis metho...
متن کاملDesigning a web log book _HIS system for the school of dentistry, Tehran university of medical sciences
Background and Aims: This study aimed to collect reports and HIS in a web-based system due to the problems of paper recording of student activities in practical courses, as well as the lack of computers in the departments for observing graphs and treatment plans. Materials and Methods: The initial graphic design of the website was done after the assessment of needs and the necessary planning f...
متن کامل